Overview

Dataset statistics

Number of variables11
Number of observations900
Missing cells881
Missing cells (%)8.9%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory212.6 KiB
Average record size in memory241.9 B

Variable types

NUM6
CAT4
BOOL1

Reproduction

Analysis started2020-05-31 22:42:30.847431
Analysis finished2020-05-31 22:42:51.432862
Versionpandas-profiling v2.6.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml
Name has a high cardinality: 899 distinct values High cardinality
Birthday_year has 177 (19.7%) missing values Missing
Medical_Tent has 702 (78.0%) missing values Missing
Family_Case_ID is highly skewed (γ1 = 26.33381889) Skewed
Parents or siblings infected has 685 (76.1%) zeros Zeros
Wife/Husband or children infected has 614 (68.2%) zeros Zeros
Medical_Expenses_Family has 15 (1.7%) zeros Zeros

Variables

Patient_ID
Real number (ℝ≥0)

UNIFORM
UNIQUE
Distinct count900
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean450.5
Minimum1
Maximum900
Zeros0
Zeros (%)0.0%
Memory size7.2 KiB

Quantile statistics

Minimum1
5-th percentile45.95
Q1225.75
median450.5
Q3675.25
95-th percentile855.05
Maximum900
Range899
Interquartile range (IQR)449.5

Descriptive statistics

Standard deviation259.9519186
Coefficient of variation (CV)0.5770297861
Kurtosis-1.2
Mean450.5
Median Absolute Deviation (MAD)225
Skewness0
Sum405450
Variance67575
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 900.], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
900 1 0.1%
 
337 1 0.1%
 
307 1 0.1%
 
306 1 0.1%
 
305 1 0.1%
 
304 1 0.1%
 
303 1 0.1%
 
302 1 0.1%
 
301 1 0.1%
 
300 1 0.1%
 
Other values (890) 890 98.9%
 
ValueCountFrequency (%) 
1 1 0.1%
 
2 1 0.1%
 
3 1 0.1%
 
4 1 0.1%
 
5 1 0.1%
 
ValueCountFrequency (%) 
900 1 0.1%
 
899 1 0.1%
 
898 1 0.1%
 
897 1 0.1%
 
896 1 0.1%
 

Family_Case_ID
Real number (ℝ≥0)

SKEWED
Distinct count675
Unique (%)75.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean14305.827777777778
Minimum345
Maximum742836
Zeros0
Zeros (%)0.0%
Memory size7.2 KiB

Quantile statistics

Minimum345
5-th percentile2741
Q18203
median13593.5
Q318906.5
95-th percentile23178.25
Maximum742836
Range742491
Interquartile range (IQR)10703.5

Descriptive statistics

Standard deviation25418.1539
Coefficient of variation (CV)1.776769181
Kurtosis753.1134642
Mean14305.82778
Median Absolute Deviation (MAD)6456.299667
Skewness26.33381889
Sum12875245
Variance646082547.7
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[3.45000e+02 1.78350e+03 1.02325e+04 1.02675e+04 2.11850e+04 2.12155e+04 2.44955e+04 7.42836e+05], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
14502 7 0.8%
 
18593 7 0.8%
 
20586 7 0.8%
 
16969 6 0.7%
 
10262 6 0.7%
 
23426 6 0.7%
 
9819 5 0.6%
 
21188 5 0.6%
 
4680 5 0.6%
 
21207 4 0.4%
 
Other values (665) 842 93.6%
 
ValueCountFrequency (%) 
345 1 0.1%
 
981 1 0.1%
 
1773 1 0.1%
 
1794 1 0.1%
 
1816 1 0.1%
 
ValueCountFrequency (%) 
742836 1 0.1%
 
125421 1 0.1%
 
24520 2 0.2%
 
24471 1 0.1%
 
24454 2 0.2%
 

Severity
Categorical

Distinct count3
Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory size7.2 KiB
3
498
1
216
2
186
ValueCountFrequency (%) 
3 498 55.3%
 
1 216 24.0%
 
2 186 20.7%
 

Length

Max length1
Mean length1
Min length1
ValueCountFrequency (%) 
Decimal_Number 3 100.0%
 
ValueCountFrequency (%) 
Common 3 100.0%
 
ValueCountFrequency (%) 
ASCII 3 100.0%
 

Name
Categorical

HIGH CARDINALITY
UNIFORM
Distinct count899
Unique (%)99.9%
Missing0
Missing (%)0.0%
Memory size7.2 KiB
Mr. Samuel Darnell
 
2
Mr. Ellis Bennie
 
1
Mr. Terence Lester
 
1
Miss Annie Hannah
 
1
Master Johnnie Frederick
 
1
Other values (894)
894
ValueCountFrequency (%) 
Mr. Samuel Darnell 2 0.2%
 
Mr. Ellis Bennie 1 0.1%
 
Mr. Terence Lester 1 0.1%
 
Miss Annie Hannah 1 0.1%
 
Master Johnnie Frederick 1 0.1%
 
Mr. Clay Danny 1 0.1%
 
Mr. Norman Wayne 1 0.1%
 
Mr. Laurence Salvador 1 0.1%
 
Ms. Natasha Glenda 1 0.1%
 
Mr. Lloyd Ellis 1 0.1%
 
Other values (889) 889 98.8%
 

Length

Max length26
Mean length17.03111111
Min length11
ValueCountFrequency (%) 
Lowercase_Letter 26 52.0%
 
Uppercase_Letter 22 44.0%
 
Space_Separator 1 2.0%
 
Other_Punctuation 1 2.0%
 
ValueCountFrequency (%) 
Latin 48 96.0%
 
Common 2 4.0%
 
ValueCountFrequency (%) 
ASCII 50 100.0%
 

Birthday_year
Real number (ℝ≥0)

MISSING
Distinct count70
Unique (%)9.7%
Missing177
Missing (%)19.7%
Infinite0
Infinite (%)0.0%
Mean1990.2669432918397
Minimum1940.0
Maximum2019.0
Zeros0
Zeros (%)0.0%
Memory size7.2 KiB

Quantile statistics

Minimum1940
5-th percentile1964
Q11982
median1992
Q31999.5
95-th percentile2016
Maximum2019
Range79
Interquartile range (IQR)17.5

Descriptive statistics

Standard deviation14.52333493
Coefficient of variation (CV)0.007297179396
Kurtosis0.1772975931
Mean1990.266943
Median Absolute Deviation (MAD)11.31994207
Skewness-0.3969539685
Sum1438963
Variance210.9272575
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
1996 31 3.4%
 
1998 28 3.1%
 
2002 27 3.0%
 
1990 26 2.9%
 
1999 25 2.8%
 
2001 25 2.8%
 
1992 25 2.8%
 
1995 24 2.7%
 
1984 22 2.4%
 
1991 22 2.4%
 
Other values (60) 468 52.0%
 
(Missing) 177 19.7%
 
ValueCountFrequency (%) 
1940 1 0.1%
 
1946 1 0.1%
 
1949 3 0.3%
 
1950 2 0.2%
 
1954 1 0.1%
 
ValueCountFrequency (%) 
2019 14 1.6%
 
2018 10 1.1%
 
2017 6 0.7%
 
2016 10 1.1%
 
2015 4 0.4%
 

Parents or siblings infected
Real number (ℝ≥0)

ZEROS
Distinct count7
Unique (%)0.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.38
Minimum0
Maximum6
Zeros685
Zeros (%)76.1%
Memory size7.2 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile2
Maximum6
Range6
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.8032470256
Coefficient of variation (CV)2.113807962
Kurtosis9.850572357
Mean0.38
Median Absolute Deviation (MAD)0.5784444444
Skewness2.756344643
Sum342
Variance0.6452057842
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0. 0.5 1.5 2.5 6. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
0 685 76.1%
 
1 120 13.3%
 
2 80 8.9%
 
5 5 0.6%
 
3 5 0.6%
 
4 4 0.4%
 
6 1 0.1%
 
ValueCountFrequency (%) 
0 685 76.1%
 
1 120 13.3%
 
2 80 8.9%
 
3 5 0.6%
 
4 4 0.4%
 
ValueCountFrequency (%) 
6 1 0.1%
 
5 5 0.6%
 
4 4 0.4%
 
3 5 0.6%
 
2 80 8.9%
 

Wife/Husband or children infected
Real number (ℝ≥0)

ZEROS
Distinct count7
Unique (%)0.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.5211111111111111
Minimum0
Maximum8
Zeros614
Zeros (%)68.2%
Memory size7.2 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q31
95-th percentile3
Maximum8
Range8
Interquartile range (IQR)1

Descriptive statistics

Standard deviation1.09838535
Coefficient of variation (CV)2.107775725
Kurtosis18.02632118
Mean0.5211111111
Median Absolute Deviation (MAD)0.7110271605
Skewness3.706736663
Sum469
Variance1.206450377
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0. 0.5 1.5 4.5 8. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
0 614 68.2%
 
1 212 23.6%
 
2 28 3.1%
 
4 18 2.0%
 
3 16 1.8%
 
8 7 0.8%
 
5 5 0.6%
 
ValueCountFrequency (%) 
0 614 68.2%
 
1 212 23.6%
 
2 28 3.1%
 
3 16 1.8%
 
4 18 2.0%
 
ValueCountFrequency (%) 
8 7 0.8%
 
5 5 0.6%
 
4 18 2.0%
 
3 16 1.8%
 
2 28 3.1%
 

Medical_Expenses_Family
Real number (ℝ≥0)

ZEROS
Distinct count218
Unique (%)24.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean895.7433333333333
Minimum0
Maximum14345
Zeros15
Zeros (%)1.7%
Memory size7.2 KiB

Quantile statistics

Minimum0
5-th percentile202
Q1221
median405
Q3861.25
95-th percentile3108.35
Maximum14345
Range14345
Interquartile range (IQR)640.25

Descriptive statistics

Standard deviation1385.829926
Coefficient of variation (CV)1.547128373
Kurtosis33.69825684
Mean895.7433333
Median Absolute Deviation (MAD)783.4914593
Skewness4.80874505
Sum806169
Variance1920524.585
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 0. 56. 177.5 196.5 201. ... 1609.5 2584. 4456.5 7355.5 14345. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
221 44 4.9%
 
225 44 4.9%
 
364 42 4.7%
 
217 41 4.6%
 
728 31 3.4%
 
202 28 3.1%
 
294 25 2.8%
 
218 24 2.7%
 
222 18 2.0%
 
0 15 1.7%
 
Other values (208) 588 65.3%
 
ValueCountFrequency (%) 
0 15 1.7%
 
112 1 0.1%
 
140 1 0.1%
 
175 1 0.1%
 
180 1 0.1%
 
ValueCountFrequency (%) 
14345 3 0.3%
 
7364 4 0.4%
 
7347 2 0.2%
 
6931 2 0.2%
 
6371 4 0.4%
 

Medical_Tent
Categorical

MISSING
Distinct count8
Unique (%)4.0%
Missing702
Missing (%)78.0%
Memory size7.2 KiB
C
57
B
46
D
31
E
31
A
15
Other values (3)
18
ValueCountFrequency (%) 
C 57 6.3%
 
B 46 5.1%
 
D 31 3.4%
 
E 31 3.4%
 
A 15 1.7%
 
F 13 1.4%
 
G 4 0.4%
 
T 1 0.1%
 
(Missing) 702 78.0%
 

Length

Max length3
Mean length2.56
Min length1
ValueCountFrequency (%) 
Uppercase_Letter 8 80.0%
 
Lowercase_Letter 2 20.0%
 
ValueCountFrequency (%) 
Latin 10 100.0%
 
ValueCountFrequency (%) 
ASCII 10 100.0%
 

City
Categorical

Distinct count3
Unique (%)0.3%
Missing2
Missing (%)0.2%
Memory size7.2 KiB
Santa Fe
649
Albuquerque
169
Taos
 
80
ValueCountFrequency (%) 
Santa Fe 649 72.1%
 
Albuquerque 169 18.8%
 
Taos 80 8.9%
 
(Missing) 2 0.2%
 

Length

Max length11
Mean length8.196666667
Min length3
ValueCountFrequency (%) 
Lowercase_Letter 11 68.8%
 
Uppercase_Letter 4 25.0%
 
Space_Separator 1 6.2%
 
ValueCountFrequency (%) 
Latin 15 93.8%
 
Common 1 6.2%
 
ValueCountFrequency (%) 
ASCII 16 100.0%
 

Deceased
Boolean

Distinct count2
Unique (%)0.2%
Missing0
Missing (%)0.0%
Memory size7.2 KiB
1
553
0
347
ValueCountFrequency (%) 
1 553 61.4%
 
0 347 38.6%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Missing values

Sample

First rows

Patient_IDFamily_Case_IDSeverityNameBirthday_yearParents or siblings infectedWife/Husband or children infectedMedical_Expenses_FamilyMedical_TentCityDeceased
0146963Miss Linda BettyNaN00225NaNSanta Fe1
12214361Ms. Ramona Elvira1966.0011663NaNAlbuquerque0
2372733Mr. Mario Vernon1982.000221NaNSanta Fe1
3482263Mr. Hector Joe1997.000220NaNSanta Fe1
45196893Ms. Jennie Debra1994.000222NaNSanta Fe0
56175982Master Terrell BobNaN000NaNSanta Fe1
6775633Mr. Kristopher Francis1984.001435NaNSanta Fe1
7895202Mr. Lorenzo Bennie1989.000364NaNSanta Fe0
8963143Mr. Rickey Dennis2000.011441NaNAlbuquerque0
910143923Miss Elena CathyNaN11626FAlbuquerque0

Last rows

Patient_IDFamily_Case_IDSeverityNameBirthday_yearParents or siblings infectedWife/Husband or children infectedMedical_Expenses_FamilyMedical_TentCityDeceased
89089119071Mr. Ronnie Hugo1992.000743CSanta Fe0
8918927428363Mr. Dawson Beil1985.000219NaNTaos1
8928931254213Ms. Shayan Meyer1973.001196NaNSanta Fe0
8938943452Mr. Adam Donovan1958.000271NaNTaos1
89489598463Mr. Noel Mcdougall1993.000243NaNSanta Fe1
89589662533Ms. Linda Wilcox1998.011344NaNSanta Fe0
89689764833Mr. Haiden Vance2006.000258NaNSanta Fe0
8978989813Miss Anaiya Love1990.000214NaNTaos1
898899164182Mr. Robert Williams1994.011812NaNSanta Fe0
89990037823Ms. Marjorie Hays2002.000202CAlbuquerque0